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METHOD AND APPARATUS FOR AN 
INTERNET PROTOCOL aP) NETWORK 
CLUSTERING SYSTEM 

CROSS-REFERENCE TO RELATED 5 
APPLICAHONS 

This application is related to application Ser. No. 09/197, 
018 entitled "Method and Apparatus for TCP/IP load bal- 
ancing in an IP Network Clustering System/' concurrently 
filed Nov. 20, 1998, and still pending. 

TECHNICAL HELD 

This invention relates to the field of Computer Systems in 
the general Network Communications sector. More 15 
specifically, the invention is a method and apparatus for an 
Internet Protocol (IP) Network clustering system. 



BACKGROUND ART 



20 



25 



30 



35 



As more and more businesses develop electronic com- 
merce applications using the Internet in order to market and 
to manage the ordering and delivery of their products, these 
businesses are searching for cost-effective Internet links that 
provide both security and high availability. Such mission- 
critical applications need to run all day, every day with the 
network components being highly reliable and easily scal- 
able as the message traffic grows. National carriers and local 
Internet Service Providers (ISPs) are now offering Virtual 
Private Networks (VPN) — enhanced Internet-based back- 
bones tying together corporate workgroups on far-flung 
Local Area Networks (LANs) — as the solution to these 
requirements. 

A number of companies have recently announced current 
or proposed VPN products and/or systems which variously 
support IPSec, IKE (ISAKMP/Oakley) encryption-key 
management, as well as draft protocols for Point-to-Point 
Tunneling protocol (PPTP), and Layer 2 T\innehng protocol 
(L2TP) in order to provide secure traffic to users. Some of 
these products include IBM's Nways Multiprotocol Routing 
Services'^" 2.2, Bay Networks Optivity*^" and Centillion'''^ 
products. Ascend Communication's MultiVPN'^" package. 
Digital Equipment's ADI VPN product family, and Indus 
River* s RiverWorks™ VPN planned products. However, 
none of these products are knov^oi to offer capabilities which 
minimizes delay and session loss by a controlled fail-over 
process. 

ITiese VPNs place enormous demands on the enterprise 
network infrastructure. Single points of failure components 
such as gateways, firewalls, tunnel servers and other choke 50 
points that need to be made highly reliable and scaleable are 
being addressed with redundant equipment such as "hot 
standbys" and various types of clustering systems. 

For example, CISCO™ Inc. now offers a new product 
called LocalDirector™ which functions as a front-end to a 
group of servers, dynamically load balances TCP traflSc 
between servers to ensure timely access and response to 
requests. The LocalDirector provides the appearance, to end 
users, of a "virtual" server. For purposes of providing 
continuous access if the LocalDirector fails, users arc 
required to purchase a redundant LocalDirector system 
which is directly attached to the primary unit, the redundant 
unit acting as a "hot" standby. The standby unit does no 
processing work itself until the master unit fails. The 
standby unit uses the failover IP address and the secondary 65 
Media Access Control (MAC) address (which are the same 
as the primary unit), thus no Address Resolution Protocol 
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60 



(ARP) is required to switch to the standby unit. However, 
because the standby unit does not keep state information on 
each connection, all active connections are dropped and 
must be re-established by the clients. Moreover, because the 
"hot standby*' does no concurrent processing it offers no 
processing load relief nor scaling ability. 

Similarly, Valence™ Research Inc. (recently purchased 
by Microsoft® Corporation) offers a software product called 
Convoy Cluster*^" (Convoy). Convoy installs as a standard 
Windows NT networking driver and runs on an existing 
LAN. It operates in a transparent manner to both server 
applications and TCP/IP clients. These clients can access the 
cluster as if it is a single computer by using one IP address. 
Convoy automatically balances the networking traffic 
between the clustered computers and can rebalance the load 
whenever a cluster member comes on-line or goes off- fine. 
However this system appears to use a compute intensive and 
memory wasteful method for determining which message 
type is to be processed by which cluster member in that the 
message source port address and destination port address 
combination is used as an index key which must be stored 
and compared against the similar combination of each 
incoming message to determine which member is to process 
the message. Moreover, this system does not do failover. 

There is a need in the art for an IP network cluster system 
which can easily scale to handle the exploding bandwidth 
requirements of users. There is a further need to maximize 
network availability, reliability and performance in terms of 
throughput, delay and packet loss by making the cluster 
overhead as efficient as possible, because more and more 
people are getting on the Internet and staying on it longer. A 
still further need exists to provide a reliable failover system 
for TCP based systems by efficiently saving the state infor- 
mation on all connections so as to minimize packet loss and 
the need for reconnections. 

Computer cluster systems including "single-system- 
image" clusters are known in the art. See for example, 
"Scalable ParaUel Computing" by Kai Hwang & Zhiwei Xu, 
McGraw-Hill, 1998, ISBN 0-07-031798-4, Chapters 9 & 10, 
Pages 453-564, which are hereby incorporated fully herein 
by reference. Various Commercial Cluster System products 
are described therein, including DEC's Tru Clusters™ 
system, IBM's SP™ system, Microsoft's Wolfpack™ sys- 
tem and The Berkeley NOW Project. None of these systems 
are known to provide efficient IP Network cluster capability 
along with combined scalability, load-balancing and con- 
troUed TCP fail-over, 

SUMMARY OF THE INVENTION 

The present invention overcomes the disadvantages of the 
above-described systems by providing an economical, high- 
performance, adaptable system and method for an IP Net- 
work cluster. 

The present invention is an IP Network clustering system 
which can provide a highly scalable system which optimizes 
message throughput by adaptively load balancing its 
components, and which minimizes delay and packet loss 
especially in the TCP mode by a controlled fail-over process. 
No other known tunnel -server systems can provide this 
combined scalability, load-balancing and controlled fail- 
over. 

The present invention includes a cluster apparatus com- 
prising a plurality of cluster members, with all cluster 
members having the same internet machine name and IP 
address, and each member having a general purpose 
processor, a memory unit, a program in the memory unit, a 
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display and an input/output unit; and the apparatus having a 
filter mechanism in each cluster member which uses a highly 
efficient bashing mechaoism to generate an index number 
for each message session where the index number is used to 
determine whether a cluster member is to process a particu- 5 
lar message or not. The index number is further used to 
designate which cluster member is responsible for process- 
ing the message and is further used to balance the processing 
load over all present cluster members. 

The present invention further includes a method for 10 
operating a plurality of computers in an IP Network cluster 
which provides a single-system-image to network users, the 
method comprising steps to interconnect the cluster 
members, and assigning all cluster members the same inter- 
net machine name and IP address whereby all cluster mem- 15 
bers can receive all messages arriving at the cluster and all 
messages passed on by the members of the cluster appear to 
come from a single unit, and to allow them to communicate 
with each other; to adaptively designate which cluster mem- 
ber will act as a master unit in the cluster; and the method 20 
providing a filter mechanism in each cluster member which 
uses a highly eflScient hashing mechanism to generate an 
index number for each message session where the index 
number is used to determine whether a cluster member is to 
process a particular message or not. ITie index number is 25 
further used to designate which cluster member is respon- 
sible for processing which message type and is further used 
to balance the processing load over all present cluster 
members. 

Other embodiments of the present invention will become "'^ 
readily apparent to those skilled in these arts from the 
following detailed description, wherein is shown and 
described only the embodiments of the invention by way of 
illustration of the best mode known at this time for carrying 
out the invention. The invention is capable of other and 
different embodiments some of which may be described for 
illustrative purposes, and several of the details are capable of 
modification in various obvious respects, all without depart- 
ing from the spirit and scope of the present invention, 

BRIEF DESCRIPTION OF THE DRAWINGS 

The features and advantages of the system and method of 
the present invention will be apparent from the following 
description in which: 

FIG. 1 illustrates a typical Internet network configuration, 

FIG. 2 illustrates a representative general purpose 
computer/cluster-member configuration. 

FIG, 3 illustrates a representative memory map of data 
contained on a related Flash Memory card. 50 

FIG. 4 illustrates a typical IP Network cluster 

FIG. 5 illustrates a general memory map of the preferred 
embodiment of a cluster member acting as a tunnel-server. 

FIG. 6 illustrates a flow-chart of the general operation of 
the cluster indicating the cluster establishment process. 

FIG. 7 illustrates an exemplary TCP state data structure. 

FIGS. 8A-8I illustrate flow-charts depicting the events 
which the master processes and the events which the non- 
master cluster members (clients) must process. 

FIGS, 9 illustrates a flow-chart depicting the normal 
packet handling process after establishing the cluster. 

BEST MODE FOR CARRYING OUT THE 
INVENTION 

A method and apparatus for operating an Internet Protocol 
(IP) Network cluster is disclosed. In the following descrip- 
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tion for purposes of explanation, specific data and configu- 
rations are set forth in order to provide a thorough under- 
standing of the present invention. In the presently preferred 
embodiment the IP Network cluster is described in terms of 
a VPN tunnel-server cluster. However, it will be apparent to 
one skilled in these arts that the present invention may be 
practiced without the specific details, in various applications 
such as a firewall cluster, a gateway or router cluster, etc. In 
other instances, well-known systems and protocols are 
shown and described in diagrammatical or block diagram 
form in order not to obscure the present invention unnec- 
essarily. 

Operating Environment 

The environment in which the present invention is used 
encompasses the general distributed computing scene which 
includes generally local area networks with hubs, routers, 
gateways, tunnel -servers, applications servers, etc. con- 
nected to other clients and other networks via the Internet, 
wherein programs and data are made available by various 
members of the system for execution and access by other 
members of the system. Some of the elements of a typical 
internet network configuration are shown in FIG. 1, wherein 
a number of cHent machines 105 possibly in a branch oflice 
of an enterprise, are shown connected to a Gateway/hub/ 
tunnel-server/etc. 106 which is itself connected to the inter- 
net 107 via some internet service provider (ISP) connection 
108. Also shown are other possible clients 101, 103 similarly 
connected to the internet 107 via an ISP connection 104, 
with these units communicating to possibly a home office via 
an ISP connection 109 to a gateway/tunnel-server 110 which 
is connected 111 to various enterprise application servers 
112, 113, 114 which could be connected through another 
hub/router 115 to various local clients 116, 117, 118, 

The present IP Network cluster is made up of a number of 
general purpose computer units each of which includes 
genera Uy the elements shown in FIG. 2, wherein the general 
purpose system 201 includes a motherboard 203 having 
thereon an input/output ("I/O**) section 205, one or more 
central processing units ("CPU") 207, and a memory section 
209 which may have a flash memory card 211 related to it. 
The I/O section 205 is connected to a keyboard 226, other 
similar general purpose computer units 225, 215, a disk 
storage unit 223 and a CD-ROM drive unit 217. The 
CD-ROM drive unit 217 can read a CD-ROM medium 219 
which typically contains programs 221 and other data. Logic 
circuits or other components of these programmed comput- 
ers will perform series of specifically identified operations 
dictated by computer programs as described more fully 
below. 

Flash memory units typically contain additional data used 
for various purposes in such computer systems. In the 
preferred embodiment of the present invention, the flash 
memory card is used to contain certain unit "personality" 
information which is shown in FIG. 3. Generally the flash 
card used in the current embodiment contains the foUowing 
type of information: 

Cryptographically signed kernel — (301) 

Configuration files (such as cluster name, specific unit IP 
address, cluster address, routing information configuration, 
etc.)— (303) 

Pointer to error message logs — (305) 

Authentication certificate — (307). 

Security policies (for example, encryption needed or not, 
etc.)— (309) 

The Invention 

The present invention is an Internet Protocol (IP) clus- 
tering system which can provide a highly scalable system 
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which optimizes throughput by adaptive ly load balancing its Madson, C, Glenn, R., "The Use of HMAC-SHA-1-96 

compODents, and which minimizes delay and session loss by within ESP and AH " draft-ieif-ipsec-auth-hmac-sha 196- 

a controlled fail-over process, A typical IP cluster system of 03.txt. 

the preferred embodiment is shown in FIG. 4 wherein the Harldns, D., Carrel, D., "The Internet Key Exchange 

internet 107 is shown connected to a typical IP cluster 401 5 (IKE)," draft-ietf-ipsec-isakmpoakley-08.txt. 

which contains programmed general purpose computer units Maughan, D., Schertler, M., Schmeider, M.. and Turner, 

403. 405, 407, 409 which act as protocol stack processors for "Internet Security Association and Key Management 

message packets received. The IP cluster 401 is typically Protocol (ISAKMP)," draft-ietf-ipsec-isakmp-10.{ps,txt}. 

connected to application servers or other similar type units u v „ «t-u /-lAi^ii^vr « • 

. , , , . • o ■ . 1 . H. K. Orman, The OAKLEY Key Determination 

411 in the network. In this figure it is shown that there lo „ i »» j «a • • 1 1 m ^ , 

^ ^ . .„ . r , ....... Protocol, draft-ietr-ipsec-oakley-02.txt. 

purposes of further illustration the cluster will be depicted as ^. ^ „^ , ,r, - 

having three units, understanding that the cluster of the ^^P^f' ^^.^.^i^lT^i f-"""^^^. °i '"^'"P"'" 

present invention is not limited to only three units. Also for ISAKMP, draft-ietf-ipsec-ipsecKloi-10.txt. 

purposes of illustration the preferred embodiment will be Tunneling protocols such as the Pomt-to-Point TunneUng 

described as a cluster whose applications may be VPN 15 Protocol (PPTP) and Layer 2 Tunneling Protocol (L2TP) 

tunnel protocols however it should be understood that this although currenUy only "draft" standards, are expected to be 

cluster invention may be used as a cluster whose appUcation confirmed as official standards by the Internet Engmeenng 

is to act as a Firewall, or to act as a gateway, or to act as a Task Force (IETF) in the very near future, and these proto- 

sccurity device, etc. cols together with the Internet Security Protocol (IPSec), 

, r ' J u J- * r *u * • -70 provide the basis for the required security of these VPNs. 

In the preferred embodiment of the present mvention, „ - . . P ^ - j l j. 

each of the cluster members is a computer system having an , ^^^^nyng again to HG. 5, the preferred embodiment m a 

Intel motheri^oard, two Intel Pentium™ processors, a 64 ^Ij^^^^/ member also contains a work assigmnent table 515 

megabyte memory and two Intel Ethernet controllers, and "^^ft ""T^'"^ message/session work-unit hash numbers 

two HiFn cryptographic processors. The functions per- and the cluster member id assigned to that work-umt; a table 

formed by each processor are generally shown by reference 25 contaming the app ication state table for this cluster member 

to the general memory map of each processor as depicted in ^J^i ^ smiilar application state table for the other members 

FIG. 5. Each cluster member has an Operating System of the cluster 519; an area for containing incoming me 

kernel 501, TCP/IP stack routines 503 and various cluster f ^^'^ ^^^^/^^"^'"f « data me^ages 

management routines (described in more detail below) 505, ^Jj^^^ m^mhtrs of the cluster 523. Those skUled m the 

program code for processing application #1 507, which in 30 art will recognize that vanous other rounnes and message 

the preferred embodiment is code for processing the IPSec ^^^^^^ implemented in such a cluster member s 

protocol, program code for processing application #2 509, "'^"^^^V P^^^^"^ ^ ^^"^^^y °f apphcations. 

which in the preferred embodiment is code for processing '^^^ S^n^^^l operation of the prefen^ed embodiment of the 

the PPTP protocol, program code for processing application ^P duster is now described in terms of (1) cluster estabUsh- 

#3 511, which in the preferred embodiment is code for 35 ment (FIG. 6) including processes for members joining the 

processing the L2TP protocol, and program code for pro- cluster and leaving the cluster; (2) master units events 

cessing application #4 513, which in the preferred embodi- Processing (FIGS. 8A-8F) and client units events processing 

ment is code space for processing an additional protocol (^^9^. 8G-«I); and (3) finally, normal message processing 

such as perhaps a "Mobile IP" protocol. DetaQed informa- activity (FIG, 9), 

tion on these protocols can be found through the home page 40 Referring now to FIG. 6 the cluster establishment activity 

of the IETF at "http://www.ietf.org'*. The foUowing specific is depicted. At system start-up 601 cluster members try to 

protocol descriptions are hereby incorporated fully herein by join the cluster by sending (broadcasting) a "join request" 

reference: message 603. This "join" message contains an authentica- 

"Point-to-Point Tunneling Protocol-PPTP", Glen Zom, " certificate obtained from a valid certificate authority. 

G, Pall, K. Hamzeh, W. Verthein, J. Taarud, W. Little, Jul. master unit receives th^ *join message it checks 

28 1998* certificate against a list of valid certificates which it holds 

' ^ ,. ^ , and if it finds no match it simply tells him the join has failed. 

„ , ^? ' .° T,' T Note that notmally when a system administrator plans to add 

Palter. T. Ko ar, G. Pall M. Lutlewood A. Valencia K. ^ ^udwarc unit to an existing cluster, he requests that his 

22°1998 ■ ' MarkTownsley, May security department or an existing security certificate author- 

' ' ity issue a certificate to the new unit and send a copy of the 

Kent, S., Atkinson, R., "IP Authentication Header," draft- certificate to the master unit in the cluster, lliis process 

ietf-ipsec-auih-header-07.txt. guarantees that someone could not illegally attach a unit to 

Kent, S., Atkinson, R., "Security Architecture for the a cluster to obtain secured messages. If the master unit does 

Internet Protocol/' draft-ietf-ipsec-arch-sec-07.txt. 55 match the certificate from the join message with a certificate 

Kent, S., Atkinson, R., "IP Encapsulating Security Pay- it holds in its memory it sends an "OK to join" message. If 

load (ESP)," draft-ietf-ipsec-esp-v2-06.txt, a "OK to join" message is received 605 then this unit is 

Pereira, R., Adams, R., "The ESP CBC-Mode Cipher designated a cluster member (client or non-master) 607. 

Algorithms," draft-ietf-ipsec-ciph-cbc-04.txt. ^ote that each cluster member has a master-watchdog timer 

^, ...T^ vrrtT T • . 60 (i.c. Bi routme to keep track of whether the member got a 

G «:nn R-. Kent S.. ^The NULL Encryption Algorithm ^ j,^^ ^^^^ ^ ^^^^-^ 

and Its Use With IPsec, draft-ielf-ipsec-ciph-nuU-O.Ltxt. ^ . „ « nnn ™-ir a\ ^-f.u 

' ^ ^ say within the last 200 milliseconds) and if the timer expires 

Madson, C.,Doraswamy,N., "The ESP DES-CBC Cipher (i.e. no keepalive message from the master during the 

Algorithm With Explicit IV," draft-ietf-ipsec-ciph-des- interval) it wiU mean that the master unit is dead 607 and the 

expiv-02.txt. gj cluster mendber/client will try to join the cluster again (611). 

Madson, C, Glenn, R., "The Use of HMAC-MD5 within Another event that will cause the cluster member/client 607 

ESP and Ali," draft-ietf-ipsec-auth-hmac-md5-96-03.txt. to try to join up again is if it gets an "exit request" message 
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(i.e. telling it to "leave the cluster") 609 If the member address to use for the cluster IP address 827 and exits 808. 

sending out the join request message (603) does not get a If the current master does not have more cluster members 

"OK to join** message 613 the member sends out than this other master 811 he asks "do I have less cluster 

(broadcasts) packets offering to become the master unit 615. members than the other master?** 813 and if so 816 he must 
If the member gets a "other master exists" message 617 the 5 give up the naaster role to the other one by exiting the cluster 

member tries to join again 603. If after the member sends out ^21 and rejoining the cluster as a member/non-master 823) 

the packets offering to become the master, he gets no exiting to 601 in FIG. 6. If the current master does not have 
response for 100 miUiseconds 619 he sends broadcast members than the other master 815 (which indicates 

Address Resolution Protocol (ARP) responses to teU anyone ^^^^ ^""^^ ^ave the same number) then the final tie-breaker 

on the network what Ethernet address to use for the duster lO %^''fj'y ^ ^^^^^^ t «?« 

jr^i J . 1 . ^ ^-i-^ II SO then agam the current master wins the tie-breaker 818 

P address 621 and now acts as the cluster master unit 623. ^^^^ ..^^^^^ ^^^^^ ^ ^ ^^^^^^ ^25 

If m this process the cluster member got no mdication that ^^^^^^^ j^^^ ^^^^j tie-breaker 819 then he exits 

another master exists (at 617) and now thinkmg it is the only ^^^^^^^ ^^-^^ ^ ^ non-master member 821. 

master 623 but yet gets a message to "exit the cluster" 641 Referring now to FIG. 8B another master event occurs 

the member must remm to try to join up agam 642. This is ^hen the master gets a "client keepalive message" (that is 

could happen for example, if this new master's configuration one from a non-master cluster member) 830. The master 

version was not correct. He would remrn, have an updated asks "is this client in my cluster?" 831 and if not the master 

configuration and attempt to rejoin. Similarly, if this member sends the client an "exit cluster" message 833 telling the 

who thinks he is the new master 623 gets a "master kee- client to exit from this cluster If the client is from this 

palive" message 625 (indicating that another cluster member 20 master's cluster the master calculates and stores a packet 

thinks he is the master unit) then he checks to see if loss average value using the sequence number of the client 

somehow the master keepalive message was from him 627 keepalive message and the calculated adaptive keepalive 

(normally the master doesn't get his own keepalive mes- interval. 835 The master then resets the watchdog timer for 

sages but it could happen) and if so he just ignores the this client 837, The watchdog timer routine is an operating 

message 639. If however the master keepalive message was 25 system routine that checks a timer value periodically to see 

not from himself 629 it means there is another cluster if the failover detection interval has elapsed since the value 

member who thinks he is the master unit and somehow this was last reset and if so the watchdog timer is said to have 

"tie" must be resolved. (This tie breaker process is described expired and the system then reacts as if the client in question 

in more detail below with respect to "Master event" has left the cluster and reassigns that clients work-load to the 

processing). If the tie is resolved in favor of the new cluster 30 remaining cluster members. 

member who thinks he is the master 635 he sends an "Other As indicated above, the master periodically sends out a 
master exists" message to the other master and once again master keepalive message containing the cluster member 
sends broadcast Address Resolution Protocol (ARP) list, the adaptive keepabve interval (which is described in 
responses to tell anyone on the network what Ethernet more detail below) and the current set of work assignments 
address to use for the cluster IP address 637 (because that 35 for each member which is used only for diagnostic purposes, 
other master could have done the same). If this new cluster (See FIG. 8C). In addition, the master periodically (in the 
member who thinks he is the master loses the tie-breaker 633 preferred embodiment every 2 seconds) checks the load- 
then he must go and join up again to try to get the cluster balance of the cluster mcmbers.fIn: FIGr8D "wfaen tfae timer-' 
stabilized. This process produces a single cluster member C2ex pires:^8 55g:ther?mast^^ 
acting as the master unit and the other cluster members 4QCbet^e9n~m^y^^edjj[sa y--A i^K 

understanding they are merely members. ^clusterrmember 857 and;jjten*asks^wo uld~mo yingC^^^ 

Master Unit Events Processing ( uni Pfrocn inosg loadted-(IQ~to~]east-lo^eg~(y 

After a cluster has formed, there are various events that efifect?^^2;that-is~if- 859^^-T}|^^£then~the 

occur which the master unit must address. How these are ^master^sen dsfF:^w6'rkrdc^as^ 

handled in the preferred embodiment are now described with 45 ^emberj^^i4he,4e,ast-lpaded.member-as 

reference to FIGS. 8A-8F, Referring to FIG. 8 A the first . (80^nd;tfiggtg^asie7^fi&lS-th^la^ 

master unit event describes the "tie-breaker" process when £7lf-1hp;pstUt-Gf-mQying4- 

two cluster members claim to be the "master" unit. Recalling (Ipa^i^l^le^ffiaH^dT-^Sqf^^ 

from above that the master normally does not receive his maSeCajmalgST no-reassi gnments^^ 

own "keepalive" message so that if a master gets a "master 50 Another master event occurs when a watchdog timer for 

keepalive" message 801 it likely indicates that another a client/cluster member expires wherein the master deletes 

cluster member thinks he is the master. In the preferred that client from the cluster data list and the deleted unit's 

embodiment, the "master keepalive" message contains the work goes into a pool of unassigned work to gel reassigned 

cluster member list, the adaptive keepalive interval (which normally as the next message arrives, (See FIG. 8E). 

is described in more detail below) and the current set of 55 Referring now to FIG, 8F another master event in the 

work assignments for each member which is used only for preferred embodiment occurs when the master gets a chent 

diagnostic purposes. So when a master gets a master kee- join request message 875. The master initially tells the client 

palive message 801 he first asks "is it from me?" 803 and if to wait by sending a NAK with an "operation in progress" 

so he just ignores this message 807 and exits 808. If the reason. 877 The master then notifies the applications that are 

master keepahve message is not from this master unit 804 60 present that a client is trying to join the cluster as some 

then the "tie-breaker" process begins by asking "Do I have apphcations want to know about it. 879. For example if 

more cluster members than this other master?" 809 If this IPSec is one of the applications then IPSec may want to 

master does then he sends a "other master exists" message validate this chent before agreeing to let it join the cluster. 

825 telling the other master to relinquish the master role and If any application rejects the join request the master sends a 

rejoin the cluster 'ITie remaining master then once again 65 NAK with the reason 855 and exits. If all applications 

sends broadcast Address Resolution Protocol (ARP) approve the join request the master sends an ACK and the 

responses to tell anyone on the network what Ethernet join proceeds as normal, 887. 
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Client Cluster Member Events 

The non-master cluster members (clients) must also send 
keepalive messages and monitor the watchdog timer for the 
master. Referring now to FIG. 8G when a client gets a 
master keepalive message 890 it updates its adaptive kee- 
palive interval 891, and checks the list of cluster menabers 
to see if any members have been lost 893. If so this client 
notifies its applications that a cluster member has departed 
895 (for example, IPSec wants to know). The cbent also 
checks to see if any members have been added to the cluster 
897 and if so notifies the applications 898 and finally resets 
the watchdog timer for monitoring the master 899 and exits. 
Each client also has a periodic timer which is adaptive to the 
network packet loss value sent by the master which requires 
the client to send a client keepalive message (containing a 
monotonically increasing numeric value) to the master peri- 
odically (See FIG. 8H). Also each client has a master 
watchdog timer it must monitor and if it expires the client 
must exit the cluster and send a new join message to re -enter 
the cluster. (See FIG. 81). 

Normal IP Packet Processing 

In order for a cluster member to correctly process only its 
share of the workload, one of three methods is used: 

1. The MAC address of the master is bound to the cluster 
IP address (using the ARP protocol). The master applies the 
filtering function (described in more detail below) to classify 
the work and forward each packet (if necessary) to the 
appropriate cluster member. 

2. A cluster-wide Unicasi MAC address is bound to the 
cluster IP address (using the ARP protocol). Each cluster 
member programs its network interface to accept packets 
from this MAC destination address. Now each cluster mem- 
ber can see all packets with the cluster IP address destina- 
tion. Each member applies the filtering function and discards 
packets that are not part of its workload. . 

3. method 2 is used but with a Multicast MAC address 
instead of a Unicast MAC address. This method is required 
when intelligent packet switching devices are part of the 
network. These devices learn which network ports are 
associated with each Unicast MAC address when they see 
packets with a Unicast MAC destination address, and they 
only send the packets to the port the switching device has 
determined is associated with that MAC address (only 1 port 
is associated with each Unicast MAC address). A Multicast 
MAC address will cause the packet switching device to 
deliver packets with the cluster IP destination address to all 
cluster members. 

In the preferred embodiment, there is a mechanism for 
designating which cluster member is to process a message 
and allow the other members to disregard the message 
without inadvertently sending a "reset" message to the 
originating client. The preferred embodiment makes use of 
a "filter** process in each cluster member which calculates a 
hash function using certain fields of the incoming message 
header. This hash calculation serves as a means of both 
assigning a work unit number to a message and assigning a 
work unit to a particular cluster member for processing. This 
technique allows a cluster member to tell whether the 
incoming message must be processed by it, therefore the 
possibility of an inadvertent "reset" message is precluded. It 
is noted that other solutions to this problem of "how to get 
the work to the right member of the cluster with minimum 
overhead** could include a hardware filler device sitting 
between the network and the cluster wherein the hardware 
filter would do the member assignment and load balancing 
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function. Note that since all cluster members have the same 
MAC address, all cluster members get all messages and the 
way they tell whether they must process the message further 
is to calculate the work unit number using the hashing 

5 method shown above and then to check the resulting work 
unit number against their work load table to see if it is 
assigned to them. If not they dump the message from their 
memory. This is a fast and efficient scheme for dumping 
messages that the units need not process further and yet it 

10 provides an efficient basis for load-balancing and efficient 
fail-over handling when a cluster member fails. 

The normal processing of IP packets is described with 
reference to FIG. 9. Upon the receipt of a packet 901 a 
determination is made as to whether the packet is addressed 

15 to a cluster IP address 903 or not. If not 905 then it is 
determined if the IP address is for this cluster member and 
if so it is processed by the IP stack locally 909. If the packet 
is to be forwarded (here the system is acting like a router) 
908 a forward filter is applied in order to classify the work 

20 913. 

This designates whether the packet is for normal work for 
the cluster clients or is forwarding work. If at step 903 where 
the address was checked to see if it was a cluster IP address, 
the answer was yes then a similar work set filter is apphed 
911 wherein the IP source and destination addresses are 
hashed modulo 1024 to produce an index value which is 
used for various purposes. This index value calculation (the 
processing filter) is required in the current embodiment and 
is described more fully as follows: 

30 . ^ 

Basically the fields containing the IP addresses, IP 
protocol, and TCPAJDP port numbers, and if the application 
is L2TP, the session and tunnel ID fields are all added 
together (logical XOR) and then shifted to produce a unique 
"work unit" number between 0 and 1023. 

For example, in the preferred embodiment the index could 
be calculated as follows: 



* Sample Ouster Filtering function 
V 

static int austcr_Filtering_Function(voil' Packet, int Forwarding) 
{ 

struct ip 'ip - (struct ip •) Packet; 
int i, length; 
45 /. 

* Select filtering scheme based on whether or not we are 
forwarding this packet 
•/ 

if (Forwarding) { 
/• 

50 • Filter Forwarded packets on source & destination 

IP address 

i - ip->ip_dst.s_addr, 
i -ip->ip__src.s_addr; 
} else { 
/* 

55 * Not forwarding: Put in the IP source address 

•/ 

i - ip->ip_src.s_addr, 
/• 

• Get the packet header length and dispatch oo protocol 
V 

length = ip->ip_hl « 2; 
if (ip->Lp_p=[PPROTO_UDP) { 
/' 

* UDP: Hash on UDP Source Port and Source IP 

Address 

V 

i '-((strua udphdr •)((char •)ip + length))- >uh_sporl; 
65 } else if (ip->ip_p— IPPROTO_TCP) { 

/' 
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-continued 

• Hash on the TCP Source Port and Source IP Address 
•/ 

i -((struct icpbdr •)((char *)ip + length))- >th_sport; 
}cUe{ 
/• 

• Any other protocol: Hash on the Destination and 
Source IP Addresses 

V 

i =ip->ip_dst.s_addr; 

} 

} 

/• 

* Collapse it into a work-set number 
V 

retum(IP_CLUSTER_HASH(i)); 

} 



Referring again to FIG. 9, and having the work set index 
value calculated each member making this calculation uses 
the index value as an indirect pointer to determine for this 
work set if it is his assigned work set 915, 917. If the index 
value does not indicate that this work set has been assigned 
to this cluster member, if this cluster member is not the 
cluster master, then the packet is simply dropped by this 
cluster member 921, 923, 925. If on the other hand this 
cluster member is the master unit 926 then the master must 
check to see if this work set has been assigned to one of the 
other cluster members for processing 927. If it has been 
assigned to another cluster member 929 the master checks to 
see if that cluster member has acknowledged receiving the 
assignment 931 and if so the master checks to see if he was 
in the multicast mode or unicast/forwarding mode 933, 935. 
If he is in the unicast or multicast mode the master drops the 
packet because the assigned cluster member would have 
seen it 936. If however, the master was in the forwarding 
mode the master will forward the packet to the assigned 
member for processing 943. If the assigned cluster member 
has not acknowledged receiving the assignment yet 940 then 
save the packet until he does acknowledge the assigmnent 
941 and then forward the packet to him to process 943. If 
when the master checked to see if this work set had been 
assigned at 927 the answer is no 928 then the master will 
assign this work set to the least loaded member 937 and then 
resume its previous task 939 until the assigned member 
acknowledges receipt of the assignment as described above. 
If work is for this member, the packet is passed on to the 
local TCP/IP stack. 
State Maintenance 

RFC 1180 A TCP/IP Tutorial, T. Socolofsky and C. Kale, 
January 1991 generally describes the TCP/IP protocol suite 
and is incorporated fully herein by reference. In the present 
invention, a key element is the ability to separate the TCP 
state into an essential portion of the state and a calculable 
portion of the state. For example, the state of a TCP message 
changes constantly and accordingly it would not be practical 
for a cluster member to transfer all of this TCP state to all 
of the other members of the cluster each time the state 
changed. This would require an excessive amount of storage 
and processing time and would essentially double the trafiSc 
to the members of the cluster. The ability of the member 
units to maintain the state of these incoming messages is 
critical to their ability to handle the failure of a member unit 
without requiring a reset of the message session. FIG. 7 
depicts the preferred embodiment's definition of which 
elements of the TCP state are considered essential and 
therefore must be transferred to each member of the cluster 
701 when it changes, and which elements of the TCP state 
are considered to be calculable from the essential state 703 
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and therefore need not be transferred to all members of the 
cluster when it changes. The TCP Failover State 700 in the 
present embodiment actually comprises three portions, an 
Initial State portion 702 which only needs to be sent once to 

5 all cluster members; the Essential State Portion 701 which 
must be sent to all cluster members for them to store when 
any item listed in the Essential portion changes; and the 
Calculable State portion 703 which is not sent to all mem- 
bers. The data to the right of the equals sign ("«") for each 
element indicates how to calculate that elements value 
whenever it is needed to do so. 
Failover Handling 

As indicated above, the preferred embodiment of the IP 
cluster apparatus and method also includes the ability to 
monitor each cluster member's operation in order to manage 

15 the cluster operation for optimal perfonmance. This means 
insuring that the cluster system recognize quickly when a 
cluster member becomes inoperative for any reason as well 
as have a reasonable process for refusing to declare a cluster 
member inoperative because of packet losses which are 

20 inherent in any TCP/IP network. This monitoring process is 
done in the preferred embodiment by a method whereby 
each non-member cluster member keeps a "master watchdog 
timer" and the master keeps a "client watchdog timer" for all 
cluster members. These watchdog timers are merely routines 

25 whereby the cluster member's OS periodically checks a 
"watchdog time-value" to see if it is more than "t" time 
earher than the current time (that is, to see if the watchdog 
time value has been reset within the last "t" time). If the 
routine finds that the difference between the current time and 
the watchdog time value is greater than "t" time then it 
declares the cluster member related to the watchdog timer to 
be inoperative. These watchdog time values are reset when- 
ever a cluster member sends a "keepalive" packet 
(sometimes called a "heartbeat" message) to the other mem- 
bers. 

Generally a "keepalive" message is a message sent by one 
network device to inform another network device that the 
virmal circuit between the two is still active. In the preferred 
embodiment the master unit sends a "master keepalive" 
packet that contains a list of the cluster members, an 

^ "adaptive keepalive interval" and a current set of work 
assigrmients for all members. The non -master cluster mem- 
bers monitor a Master watchdog timer to make sure the 
master is still alive and use the "adaptive keepalive interval" 
value supplied by the master to determine how frequently 

^5 they (the non-master cluster members) must send their 
"client keepalive" packets so that the master can monitor 
their presence in the cluster. The "client keepalive" packets 
contain a monotonically increasing sequence number which 
is used to measure packet loss in the system and to adjust the 

5° probability of packet loss value which is used to adjust the 
adaptive keepalive interval. Generally these calculations are 
done as follows in the preferred embodiment, however it 
will be understood by those skilled in these arts that various 
programming and logical circuit processes may be used to 

55 accomplish equivalent measures of packet loss and related 
watchdog timer values. 

Each client includes a sequence number in its "client 
keepalive" packet. When the master gets this keepalive 
packet for client "x" he makes the following calculations: 

60 

Si-[this sequence number]-[last sequence number}-! 

This value S^ is typically =0 or 1 and represents the 
number of dropped packets between the last two keepalive 
messages, or the current packet loss for client "x". 
65 This value is then used in an exponential smoothing 
formula to calculate current average packet loss "P" as 
follows; 
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This then represems the probability of a lost packet, 
and 

P" (P to the nth power) would represent the probability of ^ 
getting "n" successive packet losses. And 1/P" would be how 
often we would lose "n" packets in a row. 

So "n" is defined as the number of lost packets per 
interval, and P" then is the probability of losing "n" packets 
in an interval. Obviously if we lose more than some number 
of packets in a given interval the cluster member is either 
malfunctioning, inoperative or the network is having prob- 
lems. In the preferred embodiment we assume "n" is a 
number between 2 and 20 and calculate its value adaptively 
as follows 

We call the interval "K" and set 1/K=n P". By policy we 
set K=3600 (which is equivalent to a period of 1 week) and 
then calculate the smallest integer value of "n" for which n 
P". <V36oo. In the preferred embodiment this is done by 
beginning the calculation with n«2 and increasing n by 1 20 
iteratively until the condition is met. The resulting value of 
"n" is the adaptive keepalive interval which the master then 
sends to all of the cluster members to use in determining 
how often they are to send their "Client keepalive" mes- 
sages. ^5 

Having described the invention in terms of a preferred 
embodiment, it will be recognized by those skilled in the art 
that variotis types of general purpose computer hardware 
may be substituted for the configuration described above to 
achieve an equivalent result. Similarly, it will be appreciated 
that arithmetic logic circuits are configured to perform each 
required means in the claims for processing internet security 
protocols and tunneling protocols; for permitting the master 
unit to adaptively distribute processing assignments for 
incoming messages and for permitting cluster members to 35 
recognize which messages are theirs to process; and for 
recognizing messages from other members in the cluster. It 
will be apparent to those skilled in the art that modifications 
and variations of the preferred embodiment are possible, 
which fall within the true spirit and scope of the invention 
as measured by the following claims. 

What is claimed is: 

1. An Internet Protocol (IP) Network cluster apparatus 
comprising: 

a. a plurality of cluster members with all cluster members 45 
being addressable by a single dedicated Internet 
machine name and IP address for the cluster, each 
cluster member comprising a computer system having 

a processor, a memory, a program in said memory, a 
display screen and an input/output unit; 50 

b, a filter mechanism in each cluster member, the filter 
mechanism using a hashing mechanism to generate an 
index number for each message session received by the 
cluster member, the index number being used to indi- 
cate to which workset a message belongs, worksets 55 
being assigned to cluster members to balance process- 
ing load, each cluster member checking whether the 
workset has been assigned to it in order to determine 
whether the cltister member must process the message 
received or ignore it. 60 

2. The apparatus of claim 1 further comprising an assign- 
ment mechanism in each cluster member, for use by a cluster 
member designated as a master unit, the assignment mecha- 
nism used when a message of an imassigned message 
session is received by the master unit, the assignment 65 
mechanism using the index number calculated by the filler 
mechanism to assign sets of message sessions to cluster 



members for further processing in order to load balance 
processing of incoming messages. 

3. The apparatus of claim 1 further comprising a first 
program code mechanism in each of the plurality of cluster 
members configured to save state for each message session 
including TCP slate. 

4. The apparatus of claim 3 further comprising a second 
program code mechanism in each of the pliu-aUty of cluster 
members configured to transfer an essential portion of the 
saved stale for each message session to each of the other 
cluster members, whenever required. 

5. The apparatus of claim 4 further comprising a third 
program code mechanism in each of the plurality of cluster 
members configured to pennit a cluster member acting as a 
master unit to recognize an equipment failure in one of the 
other members in the cluster, to reassign the work of the 
failed cluster member to remaining members in the cluster 
thereby rebalancing the processing load and maintaining the 
message sessions. 

6. The apparatus of claim 5 further comprising a fourth 
program code mechanism in each of the plurality of cluster 
members configured to permit units which are not acting as 
the master unit to recognize an equipment failure in the 
master unit, to immediately and cooperatively designate one 
of the remaining cluster members as a new master unit, the 
new master unit to reassign the work of the failed cluster 
member to remaining cluster members thereby rebalancing 
the processing load and maintaining the message sessions. 

7. The apparatus of claim 1 wherein the memory of each 
of the cluster members includes a flash memory card con- 
taining a program code mechanism which describes the 
personality of the cluster member including its cluster 
address. 

8. A method for operating a plurality of computers in an 
Internet Protocol (IP) Network cluster, the cluster providing 
a single-system-image to network users, the method com- 
prising the steps of; 

a. providing a plurality of cluster members, each cluster 
member comprising a computer system having a 
processor, a memory, a program in said memory, a 
display screen and an input/output unit; 

b. interconnecting the cluster members together, and 
assigning all cluster members a same internet machine 
name and a same IP address whereby a message 
arriving at the cluster will be recognized by the appro- 
priate member in the cluster and an output from any 
cluster member will be recognized as coming from the 
cluster, and whereby the cluster members can commu- 
nicate with each other; and 

c. providing a filter mechanism in each cluster member, 
the filter mechanism using a hashing mechanism to 
generate an index number for each message session 
received by the cluster member, the index number 
being used to indicate to which workset a message 
belongs, worksets being assigned to cluster members to 
balance processing load, each cluster member checking 
whether the workset has been assigned to it in order to 
determine whether the cluster member must process the 
message received or ignore it. 

9. The method of claim 8 further comprising an assign- 
ment mechanism in each cluster member, for use by a cluster 
member designated as a master unit, the assignment mecha- 
nism used when a message of an unassigned message 
session is received by the master unit, the assignment 
mechanism using the index number calculated by the filter 
mechanism to assign sets of message sessions to cluster 
members for further processing in order to load balance 
processing of incoming messages. 
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10. The method of claim 8 comprising the additional step 
of each cluster member saving state for each message 
session connection including TCP state, and for segregating 
this state into an essential state portion and a non-essential 
slate portion. 5 

11. The method of claim 10 comprising the additional step 
of each cluster member transferring to each other cluster 
member the saved essential state portion for message ses- 
sions for which that cluster member is responsible, such 
transfer to be made whenever the essential portion of the 10 
state changes, whereby all cluster members maintain essen- 
tial state for all message session connections. 

12^|Rie^m^y^^d^£|daim 
of-eachclustcrcmembcr' recognizing-theieguipment-failure-of^^ 
Gn£;0&theiduste™eSbfersrimm^ reas^igriing-a-taskli5 
of:bjeing :jhe^mgg tg^.,if-it-is-the-master -^^^ 
c master-unit-reassi g nin g ^he-wor lE^hichlwas-assignedrto -the>> 
^fdled.clustcr^me^^ 
tugnehservcrs. 

13. An Internet Protocol (IP) network cluster apparatus 20 
comprising: 

a. a plurality of interconnected cluster members, each 
cluster member comprising a computer system having 
a processor, a memory, a program in said memory, a 
display screen and an input/output unit; 25 

b. means in each of the plurality of cluster members for 
recognizing other members of the plurality of cluster 
members which are connected together and cooperat- 
ing with the other members to adaptive ly designate a 
master unit; and 



c. means for generating an index number for each mes- 
sage session received by a cluster member, the index 
number being used to indicate whether the cluster 
member must process the message received or ignore 
it. 

14. ITie apparatus of claim 13 further comprising means 
in each of the plurality of cluster members for saving 
essential state for each message session. 

15. The apparatus in claim 14 further comprising means 
in each of the plurality of cluster members for periodically 
transferring the saved essential slate for each message 
session to each of the other members in the cluster. 

16. The apparatus of claim 15 further comprising means 
in each of the plurality of cluster members for permitting a 
■cluster member acting as a master unit to recognize an 
equipment failure in one of the other cluster members, and 
for reassigning work of the failed cluster member to remain- 
ing members in the cluster thereby rebalancing the process- 
ing load and maintaining message session connections, and 
for permitting cluster members which are not acting as a 
master unit to recognize an equipment failure in the master 
unit, to immediately and cooperatively designate one of the 
remaining cluster members as a new master unit, the new 
master unit to reassign work of the failed cluster member to 
remaining members in the cluster thereby rebalancing the 
processing load and maintaining message session connec- 
tions. 
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